Orthographic Disambiguation Incorporating Transliterated Probability
نویسندگان
چکیده
Orthographic variance is a fundamental problem for many natural language processing applications. The Japanese language, in particular, contains many orthographic variants for two main reasons: (1) transliterated words allow many possible spelling variations, and (2) many characters in Japanese nouns can be omitted or substituted. Previous studies have mainly focused on the former problem; in contrast, this study has addressed both problems using the same framework. First, we automatically collected both positive examples (sets of equivalent term pairs) and negative examples (sets of inequivalent term pairs). Then, by using both sets of examples, a support vector machine based classifier determined whether two terms (t1 and t2) were equivalent. To boost accuracy, we added a transliterated probability P (t1|s)P (t2|s), which is the probability that both terms (t1 and t2) were transliterated from the same source term (s), to the machine learning features. Experimental results yielded high levels of accuracy, demonstrating the feasibility of the proposed approach.
منابع مشابه
Hindi-to-Urdu Machine Translation through Transliteration
We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliterat...
متن کاملDetecting Transliterated Orthographic Variants via Two Similarity Metrics
We propose a detection method for orthographic variants caused by transliteration in a large corpus. The method employs two similarities. One is string similarity based on edit distance. The other is contextual similarity by a vector space model. Experimental results show that the method performed a 0.889 F-measure in an open test.
متن کاملLexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval
The orthographical complexity of Chinese, Japanese and Korean (CJK) poses a special challenge to the developers of computational linguistic tools, especially in the area of intelligent information retrieval. These difficulties are exacerbated by the lack of a standardized orthography in these languages, especially the highly irregular Japanese orthography. This paper focuses on the typology of ...
متن کاملGenerating Paired Transliterated-cognates Using Multiple Pronunciation Characteristics from Web corpora
A novel approach to automatically extracting paired transliterated-cognates from Web corpora is proposed in this paper. One of the most important issues addressed is that of taking multiple pronunciation characteristics into account. Terms from various languages may pronounce very differently. Incorporating the knowledge of word origin may improve the pronunciation accuracy of terms. The accura...
متن کاملUrdu - Roman Transliteration via Finite State Transducers
This paper introduces a two-way Urdu– Roman transliterator based solely on a nonprobabilistic finite state transducer that solves the encountered scriptural issues via a particular architectural design in combination with a set of restrictions. In order to deal with the enormous amount of overgenerations caused by inherent properties of the Urdu script, the transliterator depends on a set of ph...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008